Learning Joint Representations of Videos and Sentences with Web Image Search

نویسندگان

  • Mayu Otani
  • Yuta Nakashima
  • Esa Rahtu
  • Janne Heikkilä
  • Naokazu Yokoya
چکیده

Our objective is video retrieval based on natural language queries. In addition, we consider the analogous problem of retrieving sentences or generating descriptions given an input video. Recent work has addressed the problem by embedding visual and textual inputs into a common space where semantic similarities correlate to distances. We also adopt the embedding approach, and make the following contributions: First, we utilize web image search in sentence embedding process to disambiguate fine-grained visual concepts. Second, we propose embedding models for sentence, image, and video inputs whose parameters are learned simultaneously. Finally, we show how the proposed model can be applied to description generation. Overall, we observe a clear improvement over the state-of-the-art methods in the video and sentence retrieval tasks. In description generation, the performance level is comparable to the current state-of-the-art, although our embeddings were trained for the retrieval tasks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Image Classification via Sparse Representation and Subspace Alignment

Image representation is a crucial problem in image processing where there exist many low-level representations of image, i.e., SIFT, HOG and so on. But there is a missing link across low-level and high-level semantic representations. In fact, traditional machine learning approaches, e.g., non-negative matrix factorization, sparse representation and principle component analysis are employed to d...

متن کامل

رفتار اطلاع یابی دانشجویان تحصیلات تکمیلی دانشگاه علوم پزشکی قزوین برای بازیابی تصاویر و ویدئوهای تخصصی

Background and Aim: Technical videos and images are of great importance in learning different topics of medical sciences. This study is conducted to determine the effect of videos and images in learning from students’ point of view and also their problems in accessing them. Materials and Methods: This is a survey study. Data were collected by a self-made questionnaire and the population includ...

متن کامل

Improved Content Aware Image Retargeting Using Strip Partitioning

Based on rapid upsurge in the demand and usage of electronic media devices such as tablets, smart phones, laptops, personal computers, etc. and its different display specifications including the size and shapes, image retargeting became one of the key components of communication technology and internet. The existing techniques in image resizing cannot save the most valuable information of image...

متن کامل

Deep Learning for the Web

Deep learning is a machine learning technology that automatically extracts higher-level representations from raw data by stacking multiple layers of neuron-like units. The stacking allows for extracting representations of increasingly-complex features without time-consuming, offline feature engineering. Recent success of deep learning has shown that it outperforms state-of-the-art systems in im...

متن کامل

Video Annotation by Incremental Learning from Grouped Heterogeneous Sources

Transfer learning has shown promising results in leveraging loosely labeled Web images (source domain) to learn a robust classifier for the unlabeled consumer videos (target domain). Existing transfer learning methods typically apply source domain data to learn a fixed model for predicting target domain data once and for all, ignoring rapidly updating Web data and continuously changes of users ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016